7BUIS008W Data Mining & Machine Learning - Coursework 2

Andrew Keats

8th January 2017

Table of Contents

Task 1: Data Set Selection and Visualisation

Premise

You need to select a data set of your own choice (i.e. you may use a dataset already used before in the lab, or from the literature review) for the purposes of building training and validating the above type of classifiers 1-3. With the aid of R package visualise and justify the properties of the selected data set.

Picking the data set

The first task was to find out where to look for datasets, most sources of advice referred to the UCI repository, other sources often reused this data and hosted it in a new format, like Kaggle. After looking online for suitable datasets, specifically for the problem of classification, 2 datasets were possible candidates for study; between the Taiwanese Credit Card dataset and the Abalone shellfish dataset, I opted for the Abalone dataset on the basis that if has fewer variables and fewer instances, so it should be easier to work with without so much preparation; the credit card data would require more considered cleansing of outliers and potentially pruning of variables considered as well as PCA for dimensionality reduction.

The Abalone dataset has a little over 4,000 instances compared to 30,000 for the Credit Card dataset, 8 attributes for Abalone compared to 24 for the Credit Card data. Hopefully the dimensionality of the Abalone dataset should lend itself to less data cleansing because the variables are concrete characteristics of the animals:

  • Sex
  • Length
  • Diameter
  • Height
  • Whole weight
  • Shucked weight
  • Viscera weight
  • Shell weight
  • Rings

How to classify this data

The typical way to classify this data, is to determine approximate age based on the ring value, according to the UCI page for this dataset, the rings roughly suggest age such that y = r + 2.5, where y is number of years and r is number of rings. The simplest approach is to attempt to guess the correct number of rings for a test instance based on other attributes.

Getting the data

Having downloaded the data from the UCI repository we need convert the data into a table and check it’s usable.

abalone_data = data.frame(read.table("../assets/data/abalone.data", sep = ","))
abalone_attr_names = c("sex", "length", "diameter", "height", "weight.whole", "weight.shucked", "weight.viscera", "weight.shell", "rings")
colnames(abalone_data) = abalone_attr_names

#We need to check there are no missing values, as if any instances are incomlete we will need to remove
missing_vals = sum(is.na(abalone_data))
print(c("Number of missing values:", missing_vals), quote = FALSE)
## [1] Number of missing values: 0

Here’s a glance at the dataset.

head(abalone_data)
##   sex length diameter height weight.whole weight.shucked weight.viscera
## 1   M  0.455    0.365  0.095       0.5140         0.2245         0.1010
## 2   M  0.350    0.265  0.090       0.2255         0.0995         0.0485
## 3   F  0.530    0.420  0.135       0.6770         0.2565         0.1415
## 4   M  0.440    0.365  0.125       0.5160         0.2155         0.1140
## 5   I  0.330    0.255  0.080       0.2050         0.0895         0.0395
## 6   I  0.425    0.300  0.095       0.3515         0.1410         0.0775
##   weight.shell rings
## 1        0.150    15
## 2        0.070     7
## 3        0.210     9
## 4        0.155    10
## 5        0.055     7
## 6        0.120     8
abalone_summary = summary(abalone_data)

grid.table(abalone_summary)

kable(abalone_summary)
sex length diameter height weight.whole weight.shucked weight.viscera weight.shell rings
F:1307 Min. :0.075 Min. :0.0550 Min. :0.0000 Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015 Min. : 1.000
I:1342 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.000
M:1528 Median :0.545 Median :0.4250 Median :0.1400 Median :0.7995 Median :0.3360 Median :0.1710 Median :0.2340 Median : 9.000
NA Mean :0.524 Mean :0.4079 Mean :0.1395 Mean :0.8287 Mean :0.3594 Mean :0.1806 Mean :0.2388 Mean : 9.934
NA 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:11.000
NA Max. :0.815 Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :29.000
### T he attrib utes explained

The Abalone is a type of shellfish and each attribute describes a characteristic of the animal. Not being an expert on this animal, I’ve had to research a little bit about the Abalone in order to get an idea of what these traits are. Knowing a bit more about these properties, should hopefully lead to a better understanding of the data. What follows is a list of definitions for each column.

sex
M (male), F (female), or I (infant). Presumably, an infant Abalone lacks a sex or it’s too hard to identify
length
Longest shell measurement, in millimetres
diameter
Perpendicular measurement to measured length, in millimetres
height
Height of abalone, including body, in millimetres
weight.whole
Weight of whole abalone, in grams
weight.shucked
Weight of meat alone, in grams
weight.viscera
Weight of guts and other non meat alone, in grams
weight.shell
weight of shell alone, after drying, in grams
rings
Number of rings in the shell, strongly correlated to age but with a delta

A little bit of basic data cleansing

The data was checked to see if there were any missing values but it’s also worth checking of some values are technically impossible and therefore belonging to incorrect entries that need to be removed; to do this, all values are iterated over to check for assignments of 0, indicating bad data.

abalone_numeric_attr <- abalone_attr_names[which(abalone_attr_names!="sex")] 

for (abalone_attr in abalone_numeric_attr)
{
  bad_vals = abalone_data[abalone_data[abalone_attr] == 0, ]
  #col_data = unlist(abalone_data[abalone_attr])
  if (nrow(bad_vals) > 0)
  {
    print(abalone_attr)
    print(abalone_data[abalone_data[abalone_attr] == 0, ])
  }

}
## [1] "height"
##      sex length diameter height weight.whole weight.shucked weight.viscera
## 1258   I  0.430     0.34      0        0.428         0.2065         0.0860
## 3997   I  0.315     0.23      0        0.134         0.0575         0.0285
##      weight.shell rings
## 1258       0.1150     8
## 3997       0.3505     6
rm(bad_vals)

It appears that there are two height values that are incorrect, so we should remove these instances from our dataset.

abalone_data$height[abalone_data$height==0] = NA
abalone_data = na.omit(abalone_data)

It’s also worth examining the weight data, to ensure that the total data is not less than the combined values to the other weight values.

abalone_data_bad_weight = abalone_data[(abalone_data$weight.whole - (abalone_data$weight.shucked + abalone_data$weight.viscera + abalone_data$weight.shell)) < 0,]

head(abalone_data_bad_weight)
##    sex length diameter height weight.whole weight.shucked weight.viscera
## 43   I  0.240    0.175  0.045       0.0700         0.0315         0.0235
## 44   I  0.205    0.150  0.055       0.0420         0.0255         0.0150
## 45   I  0.210    0.150  0.050       0.0420         0.0175         0.0125
## 46   I  0.390    0.295  0.095       0.2030         0.0875         0.0450
## 47   M  0.470    0.370  0.120       0.5795         0.2930         0.2270
## 61   M  0.450    0.345  0.105       0.4115         0.1800         0.1125
##    weight.shell rings
## 43        0.020     5
## 44        0.012     5
## 45        0.015     4
## 46        0.075     7
## 47        0.140     9
## 61        0.135     7
print(paste(c("Number of instances where total weight is less than constiuent weights:", nrow(abalone_data_bad_weight)), sep = ""), quote = FALSE)
## [1] Number of instances where total weight is less than constiuent weights:
## [2] 154

It would appear that the data isn’t completely sanitized, with this sort of erroneous data entry needing to be handled as well. The simplest option is to rule out these instances as well.

abalone_data <- abalone_data[!(abalone_data$weight.whole - (abalone_data$weight.shucked + abalone_data$weight.viscera + abalone_data$weight.shell)) < 0,]

A graphical look at the attributes

Below is an examination of each attribute to see if there are any obvious outliers that might need to be removed

#boxplot(scale(abalone_data), main="Looking at the data graphically", xlab="Abalone Attributes", ylab="Values") 

plot(abalone_data$sex)

for (abalone_attr in abalone_numeric_attr)
{
  #print(abalone_attr)
  col_data = unlist(abalone_data[abalone_attr])
  #print(col_data)
  plot(density(col_data), xlab=abalone_attr, main=paste(c("Density of ", abalone_attr), collapse = ''))
}

rm(abalone_attr)
rm(col_data)

In order to work out which attributes should be considered to have valid outliers, I’ve gone with a heuristic approach, choosing to look at the distance between the uppermost outliers for each attribute and it’s nearest neighbour.

#Create a list to populate with our tail neighbour distances
tail_deltas <- c()

abalone_data_no_sex = abalone_data[, -1]

for (attrib in abalone_data_no_sex) {
 #get the last two values
 data_tails <- tail(sort(attrib),2)
 #push the delta on to our list 
 tail_deltas <- c(tail_deltas, diff(data_tails))
}

#grab out attribute keys to include in our new table/frame
attributes <- names(abalone_data_no_sex)

#make a new dataframe from 
dataframe <- data.frame(attributes = attributes, tail_neighbour_d=tail_deltas)

#get the order for the nearest neighbour starting with the greatest distance and descending
neighbour_order <- order(dataframe$tail_neighbour_d, decreasing=TRUE)

#now apply the order to the frame
sorted_attributes_by_neighbour_d <- dataframe[ neighbour_order, ]
sorted_attributes_by_neighbour_d
##       attributes tail_neighbour_d
## 8          rings           2.0000
## 3         height           0.6150
## 5 weight.shucked           0.1395
## 6 weight.viscera           0.1185
## 7   weight.shell           0.1080
## 4   weight.whole           0.0460
## 2       diameter           0.0200
## 1         length           0.0150

While rings is at the top of the leader-board regarding the delta, it’s important to take into account that this data isn’t scaled; it’s very likely that this outlier is the results of a particularly lucky Abalone that managed to live longer than the rest of the long tail through some luck. Given that the other values are meant to be in grams and millimetres, it’s reasonable to compare values like for like in this instance. As such the weight deltas as comparable and can be excluded from outlier cleansing, with the same applying to diameter and length; height seems to be anomalous though and will need further attention.

It’s easier to see on a box-plot that one value is way out far from any neighbours, with its nearest neighbour also having no nearby neighbour; we could probably benefit just from removing the two farthest outliers.

boxplot(abalone_data$height)

abalone_data_cleansed <- abalone_data[ abalone_data$height < .5, ]
boxplot(abalone_data_cleansed$height)

Correlation between the ‘rings’ attribute and attributes pertaining to length or mass.

Seeing as the intent is to see if classification can be used to determine approximate age from physical attributes (aside from rings), it’s worth looking for existing correlations between the attributes and the number of rings.

  qplot(x = length, 
      y = rings, 
      data = abalone_data_cleansed,
      alpha = I(0.2), # alpha to help convery density
      geom = "jitter") + # jitter so points don't stack so much
  geom_smooth(method = lm)

  qplot(x = diameter, 
      y = rings, 
      data = abalone_data_cleansed,
      alpha = I(0.2), # alpha to help convery density
      geom = "jitter") + # jitter so points don't stack so much
  geom_smooth(method = lm)

  qplot(x = height, 
      y = rings, 
      data = abalone_data_cleansed,
      alpha = I(0.2), # alpha to help convery density
      geom = "jitter") + # jitter so points don't stack so much
  geom_smooth(method = lm)

  qplot(x = weight.whole, 
      y = rings, 
      data = abalone_data_cleansed,
      alpha = I(0.2), # alpha to help convery density
      geom = "jitter") + # jitter so points don't stack so much
  geom_smooth(method = lm)

  qplot(x = weight.shucked, 
      y = rings, 
      data = abalone_data_cleansed,
      alpha = I(0.2), # alpha to help convery density
      geom = "jitter") + # jitter so points don't stack so much
  geom_smooth(method = lm)

  qplot(x = weight.viscera, 
      y = rings, 
      data = abalone_data_cleansed,
      alpha = I(0.2), # alpha to help convery density
      geom = "jitter") + # jitter so points don't stack so much
  geom_smooth(method = lm)

  qplot(x = weight.shell, 
      y = rings, 
      data = abalone_data_cleansed,
      alpha = I(0.2), # alpha to help convery density
      geom = "jitter") + # jitter so points don't stack so much
  geom_smooth(method = lm)

Picking data attributes to select

With our cleansed data we can see that there is an evident correlation between all of these attributes and the number of rings but in particular those relating to physical size show the strongest relationship as the points best match the line of best fit; the weight values are distributed in such a way that there’s a little curve away from the line as you reach 0 on both axes and they also seem to diverge more from the line as the dimension values increase. Should any attributes be selected as ones to work with, discarding the others, it would be:

  • length
  • diameter
  • height
  • weight.whole

Given that the all attributes appear to display a fairly linear relationship to the ring count we can use r-squared, otherwise known as the Coefficient of Determination to verify how well the data matches the line of best fit

abalone.lm_length <- lm(data = abalone_data_cleansed, formula = rings ~ length)
abalone.lm_diameter <- lm(data = abalone_data_cleansed, formula = rings ~ diameter)
abalone.lm_height <- lm(data = abalone_data_cleansed, formula = rings ~ height)
abalone.lm_weight.whole <- lm(data = abalone_data_cleansed, formula = rings ~ weight.whole)
abalone.lm_weight.shucked <- lm(data = abalone_data_cleansed, formula = rings ~ weight.shucked)
abalone.lm_weight.viscera <- lm(data = abalone_data_cleansed, formula = rings ~ weight.viscera)
abalone.lm_weight.shell <- lm(data = abalone_data_cleansed, formula = rings ~ weight.shell)

abalone.r_squareds <- c(
  summary(abalone.lm_length)$adj.r.squared,
summary(abalone.lm_diameter)$adj.r.squared,
summary(abalone.lm_height)$adj.r.squared,
summary(abalone.lm_weight.whole)$adj.r.squared,
summary(abalone.lm_weight.shucked)$adj.r.squared,
summary(abalone.lm_weight.viscera)$adj.r.squared,
summary(abalone.lm_weight.shell)$adj.r.squared
)

abalone_numeric_attr_no_rings <- abalone_numeric_attr[which(abalone_numeric_attr!="rings")] 

#make a new dataframe from 
dataframe.rsquareds <- data.frame(attributes = abalone_numeric_attr_no_rings, r_squared=abalone.r_squareds)

#get the order for the nearest neighbour starting with the greatest distance and descending
rsquareds_order <- order(dataframe.rsquareds$r_squared, decreasing=TRUE)

#now apply the order to the frame
sorted_attributes_by_r_squared <- dataframe.rsquareds[ rsquareds_order, ]
sorted_attributes_by_r_squared
##       attributes r_squared
## 7   weight.shell 0.3855553
## 3         height 0.3604153
## 2       diameter 0.3155304
## 1         length 0.2950611
## 4   weight.whole 0.2806291
## 6 weight.viscera 0.2439019
## 5 weight.shucked 0.1681050
#abalone.lm_length

After looking at these results it may be wiser to consider using the shell weight alone if necessary; the r-squared value isn’t a perfect predictor of the fitness but perhaps in this case, it could make more sense to use a weight value that could fluctuate less in nature. Selection of the attributes relating to size appear to be validated.

Substatiating the selection

With the use of a correlation matrix, it’s possible to further validate which attributes could be worth focusing on:

# calculate correlation matrix
correlationMatrix <- cor(abalone_data_cleansed[,2:9])
# summarize the correlation matrix
print(correlationMatrix)
##                   length  diameter    height weight.whole weight.shucked
## length         1.0000000 0.9861689 0.8974579    0.9259423      0.9003873
## diameter       0.9861689 1.0000000 0.9040452    0.9260010      0.8952472
## height         0.8974579 0.9040452 1.0000000    0.8873269      0.8361649
## weight.whole   0.9259423 0.9260010 0.8873269    1.0000000      0.9709880
## weight.shucked 0.9003873 0.8952472 0.8361649    0.9709880      1.0000000
## weight.viscera 0.9034261 0.8997239 0.8656499    0.9670606      0.9323660
## weight.shell   0.8977314 0.9055333 0.8897232    0.9561399      0.8830020
## rings          0.5433567 0.5618726 0.6004786    0.5299133      0.4102585
##                weight.viscera weight.shell     rings
## length              0.9034261    0.8977314 0.5433567
## diameter            0.8997239    0.9055333 0.5618726
## height              0.8656499    0.8897232 0.6004786
## weight.whole        0.9670606    0.9561399 0.5299133
## weight.shucked      0.9323660    0.8830020 0.4102585
## weight.viscera      1.0000000    0.9070347 0.4940547
## weight.shell        0.9070347    1.0000000 0.6210541
## rings               0.4940547    0.6210541 1.0000000
# find attributes that are highly corrected (ideally &gt;0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

abalone.correlation <- data.frame(correlation=correlationMatrix[,8])
#get rid of rings, that's obviously going to be max correlation to itself!
abalone.correlation <- data.frame(attributes=rownames(abalone.correlation)[1:7], 'correlation to rings'=abalone.correlation[1:7, 0:1])

#get the order for correlations
correlation_order <- order(abalone.correlation$correlation, decreasing=TRUE)

#now apply the order to the frame
sorted_correlation_order <- abalone.correlation[ correlation_order, 1:2]
print(sorted_correlation_order)
##       attributes correlation.to.rings
## 7   weight.shell            0.6210541
## 3         height            0.6004786
## 2       diameter            0.5618726
## 1         length            0.5433567
## 4   weight.whole            0.5299133
## 6 weight.viscera            0.4940547
## 5 weight.shucked            0.4102585

The correlation between ring values and the other attributes, when ordered, actually matches the order of the r-squared values; so much so, that weight.shell should be considered the primary correlate of them all, followed by height.

Creating a simpler classifaction value

Given that the number of rings is actually a range of integers from 1 to 29, to use each individual possible integer presents a problem; firstly as the ring value cannot be considered continuous it’s not really sensible to treat it as such, thus converting to a a numeric floating range is not appropriate; in addition to this, the dataset might not have instances that cover every single possible ring value between 1 and 29, which will cause inaccuracy in some of the models.

An alternative dataset can be created that mitigates this issue by creating an approximate age factor; classification of the Abalone can be more loosely done against age ranges, like young, middle, old with the test instances still being compared by other attributes and mapped to one of these age group factor values. The split will be detmined by the ring/age spread of the sampled population of Abalones, such that about a 1/3 of the sample size is in each age_group. This is because the number of Abalones with rings above 12 starts to drop off quite accutely and as a percentage those above 15 even are a small minority, despite the highest values nearing 30 rings.

summary(abalone_data_cleansed$rings)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       8      10      10      11      29
quantile(abalone_data_cleansed$rings, c(1/3, 2/3)) 
## 33.33333% 66.66667% 
##         9        11
abalone_age_groups <- cut(abalone_data_cleansed$rings, breaks=c(-Inf, 9, 12, Inf), labels=c("young","middle","old"))

summary(abalone_age_groups)
##  young middle    old 
##   1978   1357    684
abalone_data_cleansed_age_groups <- subset(abalone_data_cleansed, select=-rings)
abalone_data_cleansed_age_groups$age_group <- abalone_age_groups
rm(abalone_age_groups)

Task 2: Formation of Training and Test Sets

Premise

Assuming we have collected one large dataset of already-classified instances, you need to look at three methods of forming training and test sets from this single dataset in R as described below. • The holdout method • Cross-validation • Leave-one-out cross-validation (Jack Knife)

Holdout datasets

The holdout method is the most basic separation of the dataset into training and testing data. This method just creates one division, albeit randomly selecting those values from across those dataset rather than in a linear fashion.

set.seed(4321)
head(abalone_data_cleansed)
##   sex length diameter height weight.whole weight.shucked weight.viscera
## 1   M  0.455    0.365  0.095       0.5140         0.2245         0.1010
## 2   M  0.350    0.265  0.090       0.2255         0.0995         0.0485
## 3   F  0.530    0.420  0.135       0.6770         0.2565         0.1415
## 4   M  0.440    0.365  0.125       0.5160         0.2155         0.1140
## 5   I  0.330    0.255  0.080       0.2050         0.0895         0.0395
## 6   I  0.425    0.300  0.095       0.3515         0.1410         0.0775
##   weight.shell rings
## 1        0.150    15
## 2        0.070     7
## 3        0.210     9
## 4        0.155    10
## 5        0.055     7
## 6        0.120     8
holdout.train_indeces <- createDataPartition(y = abalone_data_cleansed$weight.shell, p= 2/3, list = FALSE)
head(holdout.train_indeces, n=20)
##       Resample1
##  [1,]         1
##  [2,]         2
##  [3,]         3
##  [4,]         4
##  [5,]         6
##  [6,]         7
##  [7,]         8
##  [8,]         9
##  [9,]        10
## [10,]        13
## [11,]        15
## [12,]        16
## [13,]        20
## [14,]        21
## [15,]        22
## [16,]        23
## [17,]        24
## [18,]        25
## [19,]        26
## [20,]        28
holdout.training <- abalone_data_cleansed[holdout.train_indeces,]

save(holdout.training, file="holdout.training.rda")

holdout.testing <- abalone_data_cleansed[-holdout.train_indeces,]

save(holdout.testing, file="holdout.testing.rda")

Evaluating the holdout datasets

By looking at the dimensionality it’s possible to confirm that the new datasets are of the correct size

dim(holdout.training)
## [1] 2681    9
dim(holdout.testing)
## [1] 1338    9

Holdout set for age group factor dataset

As an addition, the dataset modified to have an age_group factor attribute instead of rings attribute of integers, will also be partitioned for use with the holdout method.

head(abalone_data_cleansed_age_groups)
##   sex length diameter height weight.whole weight.shucked weight.viscera
## 1   M  0.455    0.365  0.095       0.5140         0.2245         0.1010
## 2   M  0.350    0.265  0.090       0.2255         0.0995         0.0485
## 3   F  0.530    0.420  0.135       0.6770         0.2565         0.1415
## 4   M  0.440    0.365  0.125       0.5160         0.2155         0.1140
## 5   I  0.330    0.255  0.080       0.2050         0.0895         0.0395
## 6   I  0.425    0.300  0.095       0.3515         0.1410         0.0775
##   weight.shell age_group
## 1        0.150       old
## 2        0.070     young
## 3        0.210     young
## 4        0.155    middle
## 5        0.055     young
## 6        0.120     young
holdout_age_groups.train_indeces <- createDataPartition(y = abalone_data_cleansed_age_groups$weight.shell, p= 2/3, list = FALSE)
head(holdout_age_groups.train_indeces, n=20)
##       Resample1
##  [1,]         1
##  [2,]         2
##  [3,]         3
##  [4,]         4
##  [5,]         6
##  [6,]         8
##  [7,]         9
##  [8,]        12
##  [9,]        13
## [10,]        14
## [11,]        15
## [12,]        16
## [13,]        17
## [14,]        19
## [15,]        21
## [16,]        22
## [17,]        23
## [18,]        24
## [19,]        25
## [20,]        26
holdout_age_groups.training <- abalone_data_cleansed_age_groups[holdout.train_indeces,]

holdout_age_groups.testing <- abalone_data_cleansed_age_groups[-holdout.train_indeces,]

Cross-validation datasets

Cross-validation methods are a variations on the holdout method. The k-folds cross-validation in particular is an extended holdout method whereby the dataset is chunked into smaller fragments (where the value of k is the fragment count), called ‘folds’ which are each in turn used as the test subset while the remaining folds make up the training subset; in this way, the training is carried out several times over the same dataset, rotating the role of the ‘folds’ such that every instance will be used several times as training data and once as test data.

This form of training makes better use of a small sample size and helps even out any biases that might occur from just taking one partition for training and another for testing. This averaging out of the training and testing, also happens to benefit larger datasets too, so it is generally considered superior to the basic holdout method.

Repeated k-fold Cross-validation

Repeated k-fold cross-validation takes the technique yet another step further, by splitting the dataset into k folds, repeatedly such that different fragmentation occurs each time; that is to say that even though the number of divisions are the same, each repetition creates a different set of subsets. By doing so, this sort of shuffling further economically reuses the dataset for training purposes.

For the sake of evaluating this method more thoroughly, the dataset will be used with k-fold 3 times, using Fibonacci sequence numbers 8 & 13 for k and 5 for the number of iterations of cross-validation. For comparison, a standard k-folds of will also be used

cv.train_control_8 <- trainControl(method="cv", number=8)
cv.train_control_8_5 <- trainControl(method="repeatedcv", number=8, repeats = 5)
cv.train_control_13_5 <- trainControl(method="repeatedcv", number=13, repeats = 5)

These training controls will be used later on in this study to train the various models for the task of classification but for now they are merely abstract instructions on how to chunk the data.

Leave-one-out cross-validation

Leave-one-out cross-validation is a form of exhaustive cross-validation. Exhaustive cross-validation methods are said to “learn and test on all possible ways to divide the original sample into a training and a validation set” [CITE HERE!]. The leave-one-out method is a specific form of the leave-p-out technique, where instead of determine the test dataset as a fraction of the whole, as is the case with k-fold, p is the absolute count of instances to be used in the test subset. What makes this technique exhaustive is how the p subset is iterated such that every instance will be included in at least one test subset.

Leave-p-out can be computationally expensive because the larger p is, the greater coefficient is for testing and training with the subsets, bearing in mind that for a given size of p, as many possible permutations as can be created for the test subset of this size, from the original sample set, need to be created; this is why leave-one-out might be preferred since a set of one means that there need only be as many test and training sets as the sample size value.

# define training control
loocv.train_control <- trainControl(method="LOOCV")

As with the k-folds training controls, the leave-one-out control will be used later on but for now remains an abstract set of instructions on how to chunk the data.

Task 3: Build Train and Test a Decision Tree type Classifier

Premise

You need to construct, train and test Decision Tree type classifier (C4.5, Random Forest) in R. Train and test your decision tree classifier using the training and test sets generated based on the methods tried as part of the 2nd Task.

C4.5 Decision Tree

Finding the C4.5 method as an existing library within R, brought up more than one option, both appear to use a an open-source equivalent of C4.5 rather than an official C4.5 implementation; other options we to use C5.0 which apparently supersedes C4.5; seeing as the task was to investigate C4.5, J48 has been chosen as a more faithful example of the algorithm.

Below is not only an investigation into C4.5 but also, a comparison of two variations. Initially all available dimensions will be used for training.

The J48 method

load("holdout.training.rda")
load("holdout.testing.rda")


dt.c4_5_j48_h <- J48(as.factor(rings)~., holdout.training) 
dt_sum.c4_5_j48_h <- summary(dt.c4_5_j48_h)
dt.c4_5_j48_h_party <- as.party(dt.c4_5_j48_h)

The J48 function is extremely quick with the dataset, training takes less than a second, which on it’s on is not necessarily worthy of note but certainly more interesting when compared to the speed of using the caret train method with a “J48” method argument value.

We can look at the complexity of the tree by looking at the dimensionality, where the length is effectively the tree size, width is the number of leaves, or terminal nodes and the depth is effectively the number of conditional branch layers.

length(dt.c4_5_j48_h_party)
## [1] 1534
width(dt.c4_5_j48_h_party)
## [1] 784
depth(dt.c4_5_j48_h_party)
## [1] 22

The complexity of this tree is sufficiently complex to render a graphical representation useless. In fact the tree is complex it’s time intensive as well as being of no value.

The caret train J48 argument

dt.c4_5_h2 <- train(as.factor(rings) ~., method="J48", holdout.training, tuneLength = 8)
dt_sum.c4_5_h2 <- summary(dt.c4_5_h2)
dt.c4_5_h2_finalModel <- dt.c4_5_h2$finalModel
dt_sum.c4_5_h2_final <- summary(dt.c4_5_h2_finalModel)

Whilst in comparisons to other forms of training and data-mining algorithms, under 5 minutes for a 4,000 by 9 dataset might seem okay, in comparison to the J48 function, there seems to be something at odds. The proof will be in the comparison of the two models with regard to accuracy on the test dataset.

Again, we can look at the complexity of the tree through dimensionality, length being total size, width being leaf nodes and depth being branch layer count.

#dt.c4_5_h2
#dt.c4_5_h2_finalModel <- dt.c4_5_h2$finalModel
#dt.c4_5_h2_finalModel
dt.c4_5_h2_finalModel_party <- as.party(dt.c4_5_h2_finalModel)
#dt.c4_5_h2_finalModel_party <- as.party(dt.c4_5_h2$finalModel)
length(dt.c4_5_h2_finalModel_party)
## [1] 89
width(dt.c4_5_h2_finalModel_party)
## [1] 45
depth(dt.c4_5_h2_finalModel_party)
## [1] 12

It would appear that the ‘final model’ we are looking at derived from the train function has already been pruned, making for a simpler decision tree; this tree is actually able to be represented graphically within a reasonable amount of time (under 10 seconds).

#plot(dt.c4_5_h2_finalModel)
plot(dt.c4_5_h2_finalModel_party)

Even though the tree is able to be rendered it’s not easy to get anything meaningful out of this. Perhaps the most pertinent point is that the not only could the classification levels benefit from being simplified but using fewer dimensions for observation would also force a simpler set of conditional branching.

Comparing the two C4.5/J48 methods

Below follows the summaries from both methods, for examination:

dt_sum.c4_5_j48_h
## 
## === Summary ===
## 
## Correctly Classified Instances        2033               75.8299 %
## Incorrectly Classified Instances       648               24.1701 %
## Kappa statistic                          0.7279
## Mean absolute error                      0.0234
## Root mean squared error                  0.1081
## Relative absolute error                 35.265  %
## Root relative squared error             59.4069 %
## Total Number of Instances             2681     
## 
## === Confusion Matrix ===
## 
##    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z  aa   <-- classified as
##    0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   a = 2
##    0   5   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   b = 3
##    0   1  31   0   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   c = 4
##    0   1   4  57   7   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   d = 5
##    0   0   3   5 133   7   2   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   e = 6
##    0   0   1   3  12 203  12   5   5   2   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   f = 7
##    0   0   2   1   8  22 294  25  12   6   0   1   2   1   0   0   0   0   0   0   0   0   0   0   0   0   0 |   g = 8
##    0   0   0   3   4   6  21 363  13  21   2   2   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0 |   h = 9
##    0   0   0   0   2   4  16  31 342  11   5   5   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0 |   i = 10
##    0   0   0   0   3   7  14  13  18 260   5   2   1   1   1   1   0   0   0   0   0   1   0   0   0   0   0 |   j = 11
##    0   0   0   0   0   1   9  11  15   9 114   3   0   1   0   1   1   0   0   0   0   0   0   0   0   0   0 |   k = 12
##    0   0   0   0   1   1   4   8  11  12   3  85   1   1   1   0   0   0   1   1   0   0   0   0   0   0   0 |   l = 13
##    0   0   0   0   0   4   5   6  11   8   6   4  42   1   2   2   0   0   0   0   0   0   0   0   0   0   0 |   m = 14
##    0   0   0   0   0   0   4   4   7   4   5   4   2  29   0   0   0   0   0   1   0   0   0   0   0   0   0 |   n = 15
##    0   0   0   0   0   0   1   6   3   3   1   1   3   1  26   0   1   0   0   0   0   0   0   0   0   0   0 |   o = 16
##    0   0   0   0   0   0   2   4   1   0   2   1   3   1   0  23   0   0   0   0   0   0   0   0   0   0   0 |   p = 17
##    0   0   0   0   0   1   2   1   5   4   1   1   2   2   0   0  10   0   0   0   0   0   0   0   0   0   0 |   q = 18
##    0   0   0   0   1   0   1   1   2   3   1   3   1   0   1   1   0   1   0   0   0   0   0   0   0   0   0 |   r = 19
##    0   0   0   0   0   1   0   0   1   1   2   1   0   0   0   1   0   0   7   0   0   0   0   0   0   0   0 |   s = 20
##    0   0   0   0   0   0   0   0   0   3   1   0   1   1   1   0   0   0   0   6   0   0   0   0   0   0   0 |   t = 21
##    0   0   0   0   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   u = 22
##    0   0   0   0   0   0   1   0   0   0   0   0   0   1   1   0   0   1   0   0   0   2   0   0   0   0   0 |   v = 23
##    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   0   0   0   0   0 |   w = 24
##    0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0 |   x = 25
##    0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0 |   y = 26
##    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0 |   z = 27
##    0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |  aa = 29
dt_sum.c4_5_h2
## 
## === Summary ===
## 
## Correctly Classified Instances         904               33.7188 %
## Incorrectly Classified Instances      1777               66.2812 %
## Kappa statistic                          0.2414
## Mean absolute error                      0.0578
## Root mean squared error                  0.1699
## Relative absolute error                 87.138  %
## Root relative squared error             93.3833 %
## Total Number of Instances             2681     
## 
## === Confusion Matrix ===
## 
##    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z  aa   <-- classified as
##    0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   a = 2
##    0   0   7   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   b = 3
##    0   0  20  11   0   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   c = 4
##    0   0   9  40   0  21   0   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   d = 5
##    0   0   0  34   0  85  20  10   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   e = 6
##    0   0   0   9   0 113  86  26   7   1   1   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0 |   f = 7
##    0   0   0   5   0  48 214  53  45   4   4   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0 |   g = 8
##    0   0   0   0   0  14 134 122 137  26   1   0   0   0   0   1   0   0   0   0   0   1   0   0   0   0   0 |   h = 9
##    0   0   0   1   0   8  69  42 225  61   5   3   2   0   2   0   0   0   0   0   0   0   0   0   0   0   0 |   i = 10
##    0   0   0   0   0   7  42  26 138 102   5   3   1   0   2   1   0   0   0   0   0   0   0   0   0   0   0 |   j = 11
##    0   0   0   0   0   0  30  14  48  45  24   0   0   0   2   1   0   1   0   0   0   0   0   0   0   0   0 |   k = 12
##    0   0   0   0   0   0  25  16  31  33   6  12   3   0   2   1   0   0   0   0   0   1   0   0   0   0   0 |   l = 13
##    0   0   0   0   0   1  16   5  27  23   2   4  12   0   0   0   0   0   0   0   0   1   0   0   0   0   0 |   m = 14
##    0   0   0   0   0   0  12   7  15  15   4   0   3   0   2   1   0   1   0   0   0   0   0   0   0   0   0 |   n = 15
##    0   0   0   0   0   0   7   4   2  18   2   0   1   0  12   0   0   0   0   0   0   0   0   0   0   0   0 |   o = 16
##    0   0   0   0   0   0   2   0   8  11   4   4   4   0   0   4   0   0   0   0   0   0   0   0   0   0   0 |   p = 17
##    0   0   0   0   0   0   1   4   6  12   1   1   0   0   3   0   0   0   0   0   0   1   0   0   0   0   0 |   q = 18
##    0   0   0   0   0   0   1   1   2   5   2   3   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0 |   r = 19
##    0   0   0   0   0   0   0   2   1   6   2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0 |   s = 20
##    0   0   0   0   0   0   1   0   2   7   0   0   0   0   1   0   0   2   0   0   0   0   0   0   0   0   0 |   t = 21
##    0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0 |   u = 22
##    0   0   0   0   0   0   0   0   1   1   0   1   0   0   1   0   0   0   0   0   0   2   0   0   0   0   0 |   v = 23
##    0   0   0   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   w = 24
##    0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0 |   x = 25
##    0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |   y = 26
##    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0 |   z = 27
##    0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 |  aa = 29

Accuracy comparison of C4.5/J48 models

Despite the values being present in the summaries, to clarify understanding, they are repeated below:

dt_sum.c4_5_j48_h$details[1]
## pctCorrect 
##   75.82991
dt_sum.c4_5_h2$details[1]
## pctCorrect 
##   33.71876

The J48 function has a significantly higher accuracy compared to the train function J48 call but at this stage it’s hard to be confident this is a good thing; given the difference in tree complexity, it could well be that the J48 function suffers from massive over-fitting, while the train call has done some excessive pruning which has not only accounted for the extra time for the function to complete but also the diminished accuracy. To understand things further it’s really necessary to test the trees against the validation subset.

Comparing the two C4.5/J48 model after prediction

What follows is the output of testing the models against the test subset.

holdout.test_rings <- holdout.testing$rings
dt.c4_5_j48_h_test <- predict(dt.c4_5_j48_h, newdata = holdout.testing)
holdout.test_levels <- min(holdout.test_rings):max(holdout.test_rings)
dt.c4_5_j48_h_test_cm <- confusionMatrix(factor(dt.c4_5_j48_h_test, levels=holdout.test_levels), factor(holdout.test_rings, levels = holdout.test_levels))
dt.c4_5_j48_h_test_cm$table
##           Reference
## Prediction  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
##         3   0  1  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         4   4  5  4  3  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         5   1  5  5  9  8  2  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         6   0  5 13 27 18 12  6  8  3  2  0  0  1  0  0  0  0  0  0  0  0
##         7   0  0  4 31 45 30 17  8  2  2  2  0  1  0  0  0  0  0  0  0  0
##         8   0  0  1  9 20 39 47 24 17  6  5  4  2  1  3  0  0  0  0  0  0
##         9   0  0  0  3 12 39 46 40 26 21  4  6  5  0  1  0  1  0  0  0  0
##         10  0  0  0  2  8 21 50 53 35 21 13 12  8  5  2  5  4  3  0  0  0
##         11  0  0  0  3  5 15 41 28 37 20 14  3  6  2  6  2  3  1  0  0  0
##         12  0  0  0  0  2  5  9 19 16  7  8  4  2  2  2  1  0  1  1  2  1
##         13  0  0  0  0  2  3  5  9  8  6  8  1  3  7  1  2  4  1  0  0  0
##         14  0  0  0  0  0  0  0  4  1  2  3  2  4  1  0  0  0  3  0  0  1
##         15  0  0  0  0  0  1  2  5  4  2  4  0  3  0  1  1  1  1  0  0  0
##         16  0  0  0  0  0  0  1  0  0  1  4  3  2  1  2  0  1  0  0  0  0
##         17  0  0  0  1  0  1  1  0  0  2  2  0  0  0  1  0  0  0  0  1  1
##         18  0  0  0  0  0  0  0  1  2  0  1  0  2  0  1  0  1  1  0  0  0
##         19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         20  0  0  0  0  0  0  0  1  0  0  0  0  1  0  1  1  0  1  0  0  0
##         21  0  0  0  0  0  0  0  0  0  3  1  0  0  1  0  1  0  0  0  1  0
##         22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         23  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0
##         24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         25  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         26  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         27  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##           Reference
## Prediction 24 25 26 27
##         3   0  0  0  0
##         4   0  0  0  0
##         5   0  0  0  0
##         6   0  0  0  0
##         7   0  0  0  0
##         8   0  0  0  0
##         9   0  0  0  0
##         10  0  0  0  0
##         11  0  0  0  0
##         12  0  0  0  0
##         13  0  0  0  0
##         14  0  0  0  0
##         15  0  0  0  0
##         16  0  0  0  1
##         17  0  0  0  0
##         18  0  0  0  0
##         19  0  0  0  0
##         20  0  0  0  0
##         21  0  0  0  0
##         22  0  0  0  0
##         23  0  0  0  0
##         24  0  0  0  0
##         25  0  0  0  0
##         26  0  0  0  0
##         27  0  0  0  0
dt.c4_5_j48_h_test_cm$overall[1]
##  Accuracy 
## 0.2092676
#dt.c4_5_h2
dt.c4_5_h2_test <- predict(dt.c4_5_h2, newdata = holdout.testing)
dt.c4_5_h2_test_cm <- confusionMatrix(factor(dt.c4_5_h2_test, levels=holdout.test_levels), factor(holdout.test_rings, levels = holdout.test_levels))
dt.c4_5_h2_test_cm$table
##           Reference
## Prediction  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
##         3   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         4   5  8  6  2  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         5   0  8 15 10 11  3  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         6   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         7   0  0  8 61 53 21  5  5  1  3  1  0  0  0  0  0  0  0  0  0  0
##         8   0  0  0 10 45 78 91 44 32  7 10  8  8  3  2  1  0  0  1  0  0
##         9   0  0  0  3 12 33 35 30 15 16  5  3  4  3  0  1  1  1  0  0  0
##         10  0  0  0  0  3 25 72 82 63 33 21 10 11  3  4  6  5  2  0  1  0
##         11  0  0  0  1  0  6 16 31 36 26 21  5 10  7 12  3  6  5  0  3  1
##         12  0  0  0  0  0  2  4  6  2  6  3  2  0  1  1  0  0  1  0  0  0
##         13  0  0  0  1  0  0  0  2  0  1  0  1  5  0  0  0  0  1  0  0  0
##         14  0  0  0  0  0  0  0  0  1  1  1  3  2  3  1  1  0  0  0  0  2
##         15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         16  0  0  0  0  0  0  1  1  1  0  5  1  0  0  1  1  2  0  0  0  0
##         17  0  0  0  0  0  0  1  0  0  0  1  0  0  0  0  0  0  0  0  0  0
##         18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         19  0  0  0  0  0  0  0  0  0  0  1  1  0  0  0  0  0  2  0  0  0
##         20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         21  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         23  0  0  0  0  0  0  0  0  0  2  0  1  0  0  0  0  1  0  0  0  0
##         24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         25  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         26  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         27  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##           Reference
## Prediction 24 25 26 27
##         3   0  0  0  0
##         4   0  0  0  0
##         5   0  0  0  0
##         6   0  0  0  0
##         7   0  0  0  0
##         8   0  0  0  0
##         9   0  0  0  0
##         10  0  0  0  0
##         11  0  0  0  1
##         12  0  0  0  0
##         13  0  0  0  0
##         14  0  0  0  0
##         15  0  0  0  0
##         16  0  0  0  0
##         17  0  0  0  0
##         18  0  0  0  0
##         19  0  0  0  0
##         20  0  0  0  0
##         21  0  0  0  0
##         22  0  0  0  0
##         23  0  0  0  0
##         24  0  0  0  0
##         25  0  0  0  0
##         26  0  0  0  0
##         27  0  0  0  0
dt.c4_5_h2_test_cm$overall[1]
##  Accuracy 
## 0.2361734

It would appear that despite the J48 function derived model claiming a higher accuracy, the resulting confusion matrix data suggests otherwise. Neither 23.6% or 20.9 is particularly encouraging, however for the improvement in speed, the loss of about 12.% in accuracy (approximately 3% from 24%), might be reasonable is the overall accuracy can be improved. To investigate this further, the next steps are to look at reducing the number of attributes observed based on the earlier analysis of data, and training the models to achieve simpler classification goals.

C4.5 with refined dataset

To attempt attainment of high accuracy, it’s worth looking to run the very same decision tree functions against a streamline classification aim, with a more targeted formula. Using raw rings and the classification levels amounted to nearly 30 possible outcomes, so bringing that down do a concrete tiered factor of 3 age groups has to make things not only more performant but it’s easier to be more accurate when the intended classification type has a larger scope.

Refined formula

Previously, each of the attributes of a continues numeric type were analysed for a correlation to the number of rings. To have a simpler decision tree it stands to reason that picking only the most relevant attributes for the problem of this particular classification are used. This this end, the attributes selected for the revised formula will be those top three correlates: weight.shell, height, diameter.

dt.formula <- as.formula(age_group ~ weight.shell + height + diameter)

The J48 method with updated formula and simpler classification

dt.c4_5_j48_h_a <- J48(dt.formula, holdout_age_groups.training) 
dt_sum.c4_5_j48_h_a <- summary(dt.c4_5_j48_h_a)
dt.c4_5_j48_h_a_party <- as.party(dt.c4_5_j48_h_a)
#
dt_sum.c4_5_j48_h_a
## 
## === Summary ===
## 
## Correctly Classified Instances        1892               70.5707 %
## Incorrectly Classified Instances       789               29.4293 %
## Kappa statistic                          0.5053
## Mean absolute error                      0.2797
## Root mean squared error                  0.374 
## Relative absolute error                 68.3388 %
## Root relative squared error             82.6712 %
## Total Number of Instances             2681     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##  1084  223   14 |    a = young
##   243  630   37 |    b = middle
##    79  193  178 |    c = old
#
length(dt.c4_5_j48_h_a_party)
## [1] 103
width(dt.c4_5_j48_h_a_party)
## [1] 52
depth(dt.c4_5_j48_h_a_party)
## [1] 11
#
plot(dt.c4_5_j48_h_a_party)

The caret train J48 argument with updated formula and simpler classification

dt.c4_5_h2_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8)
dt_sum.c4_5_h2_a <- summary(dt.c4_5_h2_a)
dt_sum.c4_5_h2_a_final <- summary(dt.c4_5_h2_a$finalModel)
#
dt_sum.c4_5_h2_a
## 
## === Summary ===
## 
## Correctly Classified Instances        1821               67.9224 %
## Incorrectly Classified Instances       860               32.0776 %
## Kappa statistic                          0.4683
## Mean absolute error                      0.2995
## Root mean squared error                  0.387 
## Relative absolute error                 73.1807 %
## Root relative squared error             85.5498 %
## Total Number of Instances             2681     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##  1048  229   44 |    a = young
##   234  590   86 |    b = middle
##    69  198  183 |    c = old
#
dt.c4_5_h2_a_finalModel <- dt.c4_5_h2_a$finalModel
dt.c4_5_h2_a_finalModel_party <- as.party(dt.c4_5_h2_a_finalModel)
#
length(dt.c4_5_h2_a_finalModel_party)
## [1] 33
width(dt.c4_5_h2_a_finalModel_party)
## [1] 17
depth(dt.c4_5_h2_a_finalModel_party)
## [1] 8
#
plot(dt.c4_5_h2_a_finalModel_party)

####Accuracy comparison of revised C4.5/J48 models

dt_sum.c4_5_j48_h_a$details[1]
## pctCorrect 
##   70.57068
dt_sum.c4_5_h2_a$details[1]
## pctCorrect 
##   67.92242

Interestingly the J48 function has produced an accuracy rating that is only marginally better than the previous version while the train J48 function call seems to have improved significantly so that it’s nearly on par with the J48 function result. Looking at the models after validation has happened will hopefully provide even more revealing findings.

Comparing the rivised C4.5/J48 model after prediction

What follows is the output of testing the models against the test subset.

#holdout.test_rings <- holdout.testing$rings
dt.c4_5_j48_h_a_test <- predict(dt.c4_5_j48_h_a, newdata = holdout_age_groups.testing)
#holdout.test_levels <- min(holdout.test_rings):max(holdout.test_rings)
#dt.c4_5_j48_h_a_test_cm <- confusionMatrix(factor(dt.c4_5_j48_h_test, levels=holdout.test_levels), factor(holdout.test_rings, levels = holdout.test_levels))
dt.c4_5_j48_h_a_test_cm <- confusionMatrix(dt.c4_5_j48_h_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_h_a_test_cm$table
##           Reference
## Prediction young middle old
##     young    517    147  34
##     middle   119    267 138
##     old       21     33  62
dt.c4_5_j48_h_a_test_cm$overall[1]
## Accuracy 
## 0.632287
#dt.c4_5_h2
dt.c4_5_h2_a_test <- predict(dt.c4_5_h2_a, newdata = holdout_age_groups.testing)
dt.c4_5_h2_a_test_cm <- confusionMatrix(dt.c4_5_h2_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_h2_a_test_cm$table
##           Reference
## Prediction young middle old
##     young    510    134  27
##     middle   120    263 123
##     old       27     50  84
dt.c4_5_h2_a_test_cm$overall[1]
##  Accuracy 
## 0.6405082

The accuracy has improved markedly so it’s safe to say that the combination of streamlining the formula and creating a simpler classification requirement has improved things; this is the new baseline, now it’s worth looking at any improvement that can be made through using the k-folds and leave-one-out cross-validation techniques.

C4.5 with cross-validation techniques

Given how the accuracy between the two form of C4.5 model generation narrowed to an absolute percentage delta of lest than 1 percent, coupled with the speed at which the J48 function returns, the next phase of experimentation will occur only with this function and the relevant training control options.

J48 with k-folds training

The J48 function does not accept the caret training control objects as valid control parameters; calling the Weka function WOW with "J48" as the sole argument presents the list of arguments that can be passed in to a Weka_control function call to configure training.

WOW("J48")
## -U      Use unpruned tree.
## -O      Do not collapse tree.
## -C <pruning confidence>
##         Set confidence threshold for pruning.  (default 0.25)
##  Number of arguments: 1.
## -M <minimum number of instances>
##         Set minimum number of instances per leaf.  (default 2)
##  Number of arguments: 1.
## -R      Use reduced error pruning.
## -N <number of folds>
##         Set number of folds for reduced error pruning. One fold is
##         used as pruning set.  (default 3)
##  Number of arguments: 1.
## -B      Use binary splits only.
## -S      Do not perform subtree raising.
## -L      Do not clean up after the tree has been built.
## -A      Laplace smoothing for predicted probabilities.
## -J      Do not use MDL correction for info gain on numeric
##         attributes.
## -Q <seed>
##         Seed for random data shuffling (default 1).
##  Number of arguments: 1.
## -doNotMakeSplitPointActualValue
##         Do not make split point actual value.
## -output-debug-info
##         If set, classifier is run in debug mode and may output
##         additional info to the console
## -do-not-check-capabilities
##         If set, classifier capabilities are not checked before
##         classifier is built (use with caution).
## -num-decimal-places
##         The number of decimal places for the output of numbers in
##         the model (default 2).
##  Number of arguments: 1.
## -batch-size
##         The desired batch size for batch prediction (default 100).
##  Number of arguments: 1.

Unfortunately, for the performant J48 method, none of these options seems to allow for custom methods of training. At this point, J48 has to be disregarded due to inflexibility despite such good run-time training speeds.

Train with J48 method and k-folds training

Three types of k-folds cross-validation configurations were created; at this point it’s opportune to examine which, if any, are able to improve the accuracy of training and validation. Each of the three types of k-folds configurations will be used to train the decision tree.

The imporant implementation detail to emphasise here is that while the holdout method required a one-off expicit call to partition the data, with the user assigning which part as training or testing subset, with cross-validation the entire dataset is passed into the traniing call; this is because the training will occur several times using many subsampled testing subsets from the original sample. For the sake of completeness, the holdout testing suset can still be used to further interrogate the model, in order to better compare against other classifiers.

K-folds, with 8 folds

#dt.c4_5_j48_kf_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8, trControl = cv.train_control_8)
#abalone_data_cleansed_age_groups
dt.c4_5_j48_kf_a <- train(dt.formula, method="J48", abalone_data_cleansed_age_groups, tuneLength = 8, trControl = cv.train_control_8)

K-folds, with 8 folds, 5 repetitions

#dt.c4_5_j48_r5kf8_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8, trControl = cv.train_control_8_5)
#abalone_data_cleansed_age_groups
dt.c4_5_j48_r5kf8_a <- train(dt.formula, method="J48", abalone_data_cleansed_age_groups, tuneLength = 8, trControl = cv.train_control_8_5)

K-folds, with 13 folds, 5 repetitions

#dt.c4_5_j48_r5kf13_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8, trControl = cv.train_control_13_5)
#abalone_data_cleansed_age_groups
dt.c4_5_j48_r5kf13_a <- train(dt.formula, method="J48", abalone_data_cleansed_age_groups, tuneLength = 8, trControl = cv.train_control_13_5)
Training results J48 method and k-folds training

Below is a comparison of the different training results for k-folds; ultimately, the best one will be picked as the preferred use of k-folds going forward; obviously should this preferred configuration turn out to be too time intensive with other types of classifier, falling back to another configuration will be considered.

K-folds, with 8 folds

dt_sum.c4_5_j48_kf_a <- summary(dt.c4_5_j48_kf_a)
dt.c4_5_j48_kf_a_final <- summary(dt.c4_5_j48_kf_a$finalModel)
#
dt_sum.c4_5_j48_kf_a
## 
## === Summary ===
## 
## Correctly Classified Instances        2771               68.9475 %
## Incorrectly Classified Instances      1248               31.0525 %
## Kappa statistic                          0.4824
## Mean absolute error                      0.2887
## Root mean squared error                  0.3799
## Relative absolute error                 70.4376 %
## Root relative squared error             83.9297 %
## Total Number of Instances             4019     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##  1647  268   63 |    a = young
##   402  831  124 |    b = middle
##    95  296  293 |    c = old
#
dt.c4_5_j48_kf_a_finalModel <- dt.c4_5_h2_a$finalModel
dt.c4_5_j48_kf_a_finalModel_party <- as.party(dt.c4_5_h2_a$finalModel)
#
length(dt.c4_5_j48_kf_a_finalModel_party)
## [1] 33
width(dt.c4_5_j48_kf_a_finalModel_party)
## [1] 17
depth(dt.c4_5_j48_kf_a_finalModel_party)
## [1] 8
#
plot(dt.c4_5_j48_kf_a_finalModel_party)

K-folds, with 8 folds, 5 repetitions

dt_sum.c4_5_r5kf8_a <- summary(dt.c4_5_j48_r5kf8_a)
dt.c4_5_j48_r5kf8_a_final <- summary(dt.c4_5_j48_r5kf8_a$finalModel)
#
dt_sum.c4_5_r5kf8_a
## 
## === Summary ===
## 
## Correctly Classified Instances        2694               67.0316 %
## Incorrectly Classified Instances      1325               32.9684 %
## Kappa statistic                          0.4385
## Mean absolute error                      0.3154
## Root mean squared error                  0.3971
## Relative absolute error                 76.9366 %
## Root relative squared error             87.7163 %
## Total Number of Instances             4019     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##  1696  239   43 |    a = young
##   516  746   95 |    b = middle
##   168  264  252 |    c = old
#
dt.c4_5_j48_r5kf8_a_finalModel <- dt.c4_5_j48_r5kf8_a$finalModel
dt.c4_5_j48_r5kf8_a_finalModel_party <- as.party(dt.c4_5_j48_r5kf8_a_finalModel)
#
length(dt.c4_5_j48_r5kf8_a_finalModel_party)
## [1] 39
width(dt.c4_5_j48_r5kf8_a_finalModel_party)
## [1] 20
depth(dt.c4_5_j48_r5kf8_a_finalModel_party)
## [1] 9
#
plot(dt.c4_5_j48_r5kf8_a_finalModel_party)

K-folds, with 13 folds, 5 repetitions

dt_sum.c4_5_r5kf13_a <- summary(dt.c4_5_j48_r5kf13_a)
dt.c4_5_j48_r5kf13_a_final <- summary(dt.c4_5_j48_r5kf13_a$finalModel)
#
dt_sum.c4_5_r5kf13_a
## 
## === Summary ===
## 
## Correctly Classified Instances        2731               67.9522 %
## Incorrectly Classified Instances      1288               32.0478 %
## Kappa statistic                          0.4628
## Mean absolute error                      0.2957
## Root mean squared error                  0.3845
## Relative absolute error                 72.1388 %
## Root relative squared error             84.9372 %
## Total Number of Instances             4019     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##  1655  264   59 |    a = young
##   436  801  120 |    b = middle
##   113  296  275 |    c = old
#
dt.c4_5_j48_r5kf13_a_finalModel <- dt.c4_5_j48_r5kf13_a$finalModel
dt.c4_5_j48_r5kf13_a_finalModel_party <- as.party(dt.c4_5_j48_r5kf13_a_finalModel)
#
length(dt.c4_5_j48_r5kf13_a_finalModel_party)
## [1] 59
width(dt.c4_5_j48_r5kf13_a_finalModel_party)
## [1] 30
depth(dt.c4_5_j48_r5kf13_a_finalModel_party)
## [1] 9
#
plot(dt.c4_5_j48_r5kf13_a_finalModel_party)

dt_sum.c4_5_j48_kf_a$details[1]
## pctCorrect 
##    68.9475
dt_sum.c4_5_r5kf8_a$details[1]
## pctCorrect 
##    67.0316
dt_sum.c4_5_r5kf13_a$details[1]
## pctCorrect 
##   67.95223

After having run the training, the preliminary results suggest that while repeating the whole k-folds process a number of times can add some more accuracy, increasing the k count doesn’t add much more; depending on the size of the real-world dataset and the economic benefit of even the slightest improvement in accuracy, it’s arguable that a tradeoff between accuracy and performance can be made and sometimes a quicker less accurate option is the right tool for the job. That being said, prediction is significantly faster than training for decision trees; once the model has been built, new data is just processed into a class.

The fact that for the k-folds with 5 repititions, the k = 13 version took about twice as long to compute as the k = 8 version, the 8 version will be considered the preffered choice, assuming the validation doesn’t suggest otherwise.

Testing results J48 method and k-folds training

K-folds, with 8 folds

dt.c4_5_j48_kf_a_test <- predict(dt.c4_5_j48_kf_a, newdata = holdout_age_groups.testing)
dt.c4_5_j48_kf_a_test_cm <- confusionMatrix(dt.c4_5_j48_kf_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_kf_a_test_cm$table
##           Reference
## Prediction young middle old
##     young    543    146  28
##     middle    90    257 116
##     old       24     44  90
dt.c4_5_j48_kf_a_test_cm$overall[1]
##  Accuracy 
## 0.6651719

K-folds, with 8 folds, 5 repetitions

dt.c4_5_j48_r5kf8_a_test <- predict(dt.c4_5_j48_r5kf8_a, newdata = holdout_age_groups.testing)
dt.c4_5_j48_r5kf8_a_test_cm <- confusionMatrix(dt.c4_5_j48_r5kf8_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_r5kf8_a_test_cm$table
##           Reference
## Prediction young middle old
##     young    565    186  47
##     middle    78    227 108
##     old       14     34  79
dt.c4_5_j48_r5kf8_a_test_cm$overall[1]
##  Accuracy 
## 0.6509716

K-folds, with 13 folds, 5 repetitions

dt.c4_5_j48_r5kf13_a_test <- predict(dt.c4_5_j48_r5kf13_a, newdata = holdout_age_groups.testing)
dt.c4_5_j48_r5kf13_a_test_cm <- confusionMatrix(dt.c4_5_j48_r5kf13_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_r5kf13_a_test_cm$table
##           Reference
## Prediction young middle old
##     young    547    159  32
##     middle    88    244 114
##     old       22     44  88
dt.c4_5_j48_r5kf13_a_test_cm$overall[1]
##  Accuracy 
## 0.6569507

J48 with leave-one-out training

The last C4.5 decision tree to explore is a model using the leave-one-out training method, however research and experimentation suggests this is too computationally expensive for decision trees; when attempting to train a tree with this training control, over an hour passed without any output; it would appear that for trees, not only p but n (the total number of observations) should also be very low, certainly smaller than the Abalone dataset.

TODO! - conclude this section!

Random Forest Decision Tree

TODO! … Add intro here

Random Forest Holdout

Random Forest Holdout training

dt.rf_h_a <- train(dt.formula, data = holdout_age_groups.training, method = "rf", prox= TRUE)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
dt.rf_h_a_finalModel <- dt.rf_h_a$finalModel
dt_sum.rf_h_a <- summary(dt.rf_h_a)
#
dt.rf_h_a_finalModel$forest[11]
## $nrnodes
## [1] 1559
dt.rf_h_a_finalModel$forest[12]
## $ntree
## [1] 500
#
print(dt.rf_h_a)
## Random Forest 
## 
## 2681 samples
##    3 predictors
##    3 classes: 'young', 'middle', 'old' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2681, 2681, 2681, 2681, 2681, 2681, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.6078591  0.3560456
##   3     0.6026535  0.3492116
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(dt.rf_h_a_finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 39.16%
## Confusion matrix:
##        young middle old class.error
## young    994    261  66   0.2475397
## middle   289    467 154   0.4868132
## old       74    206 170   0.6222222
plot(dt.rf_h_a_finalModel)

Random Forest Holdout testing

dt.rf_h_a_test <- predict(dt.rf_h_a, newdata = holdout_age_groups.testing)

dt.c4_5_j48_h_a_test_cm <- confusionMatrix(dt.rf_h_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_h_a_test_cm$table
##           Reference
## Prediction young middle old
##     young    503    151  37
##     middle   118    218 115
##     old       36     78  82
dt.c4_5_j48_h_a_test_cm$overall[1]
##  Accuracy 
## 0.6001495

Random Forest Cross-validation

TODO: open with premise for fandom forest that includes avoiding over-fitting.

Random Forest Cross-validation training

dt.rf_cv_r5kf8_a <- train(dt.formula, data = abalone_data_cleansed_age_groups, method = "rf", prox= TRUE, trControl = cv.train_control_8_5)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
dt.rf_cv_r5kf8_a_finalModel <- dt.rf_cv_r5kf8_a$finalModel
dt_sum.rf_cv_r5kf8_a <- summary(dt.rf_cv_r5kf8_a)
#
dt.rf_cv_r5kf8_a_finalModel$forest[11]
## $nrnodes
## [1] 2343
dt.rf_cv_r5kf8_a_finalModel$forest[12]
## $ntree
## [1] 500
#
print(dt.rf_cv_r5kf8_a)
## Random Forest 
## 
## 4019 samples
##    3 predictors
##    3 classes: 'young', 'middle', 'old' 
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold, repeated 5 times) 
## Summary of sample sizes: 3517, 3515, 3518, 3515, 3517, 3516, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.6021386  0.3466302
##   3     0.5988570  0.3424393
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(dt.rf_cv_r5kf8_a_finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 39.86%
## Confusion matrix:
##        young middle old class.error
## young   1483    406  89   0.2502528
## middle   437    679 241   0.4996315
## old      113    316 255   0.6271930
plot(dt.rf_cv_r5kf8_a_finalModel)

Random Forest Cross-validation testing

… Based on the previous perfermances for k-folds cross-validation, only one configuarion has been used for the ske of the random forest tree, being the one considered the happy medium of the original 3.

dt.rf_cv_r5kf8_a_test <- predict(dt.rf_cv_r5kf8_a, newdata = holdout_age_groups.testing)

dt.rf_cv_r5kf8_a_test_cm <- confusionMatrix(dt.rf_cv_r5kf8_a_test, holdout_age_groups.testing$age_group)
dt.rf_cv_r5kf8_a_test_cm$table
##           Reference
## Prediction young middle old
##     young    650      4   3
##     middle     5    440   4
##     old        2      3 227
dt.rf_cv_r5kf8_a_test_cm$overall[1]
##  Accuracy 
## 0.9843049

… TODO: write up about the performance

Random Forest Leave-one-out

As with the C4.5 decisiotn tree, Leave-one-out Cross-validation is not suitable for this type of model.

Decision Tree Conclusion

Random forests are really good for avoiding over-fitting and can improve real world accuracy

Task 4: Build Train and Test a Naïve Bayes type Classifier

Premise

You need to construct, train and test Naïve Bayes type classifier in R. Train and test your Naïve Byes classifier using the training and test sets generated based on the methods tried as part of the 2nd Task.

Naive Bayes is a relatively quick and simple probabilistic classifier that is often used as a benchmark for other forms of classification; if another technique is in some way superior, be that in terms of speed or accuracy, then it’s worth sharing. If a proposed algorithm cannot improve on Naive Bayes, then it needs further work, or consigned to only be of use to a very niche problem or set aside completely.

Naive Bayes training

Naive Bayes formula and simpler classification

Precedent was set in the previous task to use a specific formula to target key predictors (weight.shell, height, diameter) and classify data instances based on a factor representing simplified age grouping; that precedent applies to the modelling in this chapter and onwards.

Naive Bayes with holdout

nb.h_a <- NaiveBayes(dt.formula, data=holdout_age_groups.training)
plot(nb.h_a)

Naive Bayes with holdout, training results

nb.h_a_test <- predict(nb.h_a, holdout_age_groups.testing)
# summarize results
nb.h_a_test_cm <- confusionMatrix(nb.h_a_test$class, holdout_age_groups.testing$age_group)
nb.h_a_test_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction young middle old
##     young    468    121  45
##     middle   186    293 149
##     old        3     33  40
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5987          
##                  95% CI : (0.5718, 0.6251)
##     No Information Rate : 0.491           
##     P-Value [Acc > NIR] : 1.811e-15       
##                                           
##                   Kappa : 0.3318          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: young Class: middle Class: old
## Sensitivity                0.7123        0.6555     0.1709
## Specificity                0.7562        0.6240     0.9674
## Pos Pred Value             0.7382        0.4666     0.5263
## Neg Pred Value             0.7315        0.7831     0.8463
## Prevalence                 0.4910        0.3341     0.1749
## Detection Rate             0.3498        0.2190     0.0299
## Detection Prevalence       0.4738        0.4694     0.0568
## Balanced Accuracy          0.7343        0.6397     0.5692

Naive Bayes with Repeated K-folds Cross-validation

nb.r5kf8_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = cv.train_control_8_5, method="nb")
plot(nb.r5kf8_a)

Naive Bayes with Repeated K-folds, training results

nb.r5kf8_a_test <- predict(nb.r5kf8_a, holdout_age_groups.testing)
# summarize results
nb.r5kf8_a_test_cm <- confusionMatrix(nb.r5kf8_a_test, holdout_age_groups.testing$age_group)
nb.r5kf8_a_test_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction young middle old
##     young    481    128  53
##     middle   174    288 145
##     old        2     31  36
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6016         
##                  95% CI : (0.5748, 0.628)
##     No Information Rate : 0.491          
##     P-Value [Acc > NIR] : 3.029e-16      
##                                          
##                   Kappa : 0.3321         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: young Class: middle Class: old
## Sensitivity                0.7321        0.6443    0.15385
## Specificity                0.7342        0.6420    0.97011
## Pos Pred Value             0.7266        0.4745    0.52174
## Neg Pred Value             0.7396        0.7825    0.84397
## Prevalence                 0.4910        0.3341    0.17489
## Detection Rate             0.3595        0.2152    0.02691
## Detection Prevalence       0.4948        0.4537    0.05157
## Balanced Accuracy          0.7332        0.6431    0.56198

Naive Bayes with Leave-one-out Cross-validation

nb.loo_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = loocv.train_control, method="nb")
plot(nb.loo_a)

Naive Bayes with Leave-one-out, training results

nb.loo_a_test <- predict(nb.loo_a, holdout_age_groups.testing)
# summarize results
nb.loo_a_test_cm <- confusionMatrix(nb.loo_a_test, holdout_age_groups.testing$age_group)
nb.loo_a_test_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction young middle old
##     young    481    128  53
##     middle   174    288 145
##     old        2     31  36
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6016         
##                  95% CI : (0.5748, 0.628)
##     No Information Rate : 0.491          
##     P-Value [Acc > NIR] : 3.029e-16      
##                                          
##                   Kappa : 0.3321         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: young Class: middle Class: old
## Sensitivity                0.7321        0.6443    0.15385
## Specificity                0.7342        0.6420    0.97011
## Pos Pred Value             0.7266        0.4745    0.52174
## Neg Pred Value             0.7396        0.7825    0.84397
## Prevalence                 0.4910        0.3341    0.17489
## Detection Rate             0.3595        0.2152    0.02691
## Detection Prevalence       0.4948        0.4537    0.05157
## Balanced Accuracy          0.7332        0.6431    0.56198

Accuracy comparison of NB models

nb.h_a_test_cm$overall[1]
##  Accuracy 
## 0.5986547
nb.r5kf8_a_test_cm$overall[1]
##  Accuracy 
## 0.6016442
nb.loo_a_test_cm$overall[1]
##  Accuracy 
## 0.6016442

Task 5: Build Train and Test a K-NN type Classifier

Premise

You need to construct, train and test K-NN type classifier in R. Train and test your K-NN classifier using the training and test sets generated based on the methods tried as part of the 2nd Task.

… TODO Describe KNN a bit

KNN modelling

KNN with holdout

knn.h_a <- train(dt.formula, data=holdout_age_groups.training, method = "knn")
plot(knn.h_a)

KNN with holdout, testing

knn.h_a_test <- predict(knn.h_a, holdout_age_groups.testing)
# summarize results
knn.h_a_test_cm <- confusionMatrix(knn.h_a_test, holdout_age_groups.testing$age_group)
knn.h_a_test_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction young middle old
##     young    526    159  33
##     middle   110    233 130
##     old       21     55  71
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6203          
##                  95% CI : (0.5937, 0.6464)
##     No Information Rate : 0.491           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3664          
##  Mcnemar's Test P-Value : 4.017e-09       
## 
## Statistics by Class:
## 
##                      Class: young Class: middle Class: old
## Sensitivity                0.8006        0.5213    0.30342
## Specificity                0.7181        0.7306    0.93116
## Pos Pred Value             0.7326        0.4926    0.48299
## Neg Pred Value             0.7887        0.7526    0.86314
## Prevalence                 0.4910        0.3341    0.17489
## Detection Rate             0.3931        0.1741    0.05306
## Detection Prevalence       0.5366        0.3535    0.10987
## Balanced Accuracy          0.7593        0.6259    0.61729

KNN with Repeated K-folds Cross-validation

knn.r5kf8_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = cv.train_control_8_5, method = "knn")
plot(knn.r5kf8_a)

KNN with Repeated K-folds Cross-validation, testing

knn.r5kf8_a_test <- predict(knn.r5kf8_a, holdout_age_groups.testing)
# summarize results
knn.r5kf8_a_test_cm <- confusionMatrix(knn.r5kf8_a_test, holdout_age_groups.testing$age_group)
knn.r5kf8_a_test_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction young middle old
##     young    550    130  37
##     middle    91    276  98
##     old       16     41  99
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6913         
##                  95% CI : (0.6658, 0.716)
##     No Information Rate : 0.491          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4859         
##  Mcnemar's Test P-Value : 2.133e-08      
## 
## Statistics by Class:
## 
##                      Class: young Class: middle Class: old
## Sensitivity                0.8371        0.6174    0.42308
## Specificity                0.7548        0.7879    0.94837
## Pos Pred Value             0.7671        0.5935    0.63462
## Neg Pred Value             0.8277        0.8041    0.88579
## Prevalence                 0.4910        0.3341    0.17489
## Detection Rate             0.4111        0.2063    0.07399
## Detection Prevalence       0.5359        0.3475    0.11659
## Balanced Accuracy          0.7960        0.7027    0.68572

Naive Bayes with Leave-one-out Cross-validation

knn.loo_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = loocv.train_control, method="knn")
plot(knn.loo_a)

Naive Bayes with Leave-one-out, training results

knn.loo_a_test <- predict(knn.loo_a, holdout_age_groups.testing)
# summarize results
knn.loo_a_test_cm <- confusionMatrix(knn.loo_a_test, holdout_age_groups.testing$age_group)
knn.loo_a_test_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction young middle old
##     young    548    132  36
##     middle    92    277  99
##     old       17     38  99
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6906         
##                  95% CI : (0.665, 0.7153)
##     No Information Rate : 0.491          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4845         
##  Mcnemar's Test P-Value : 6.183e-09      
## 
## Statistics by Class:
## 
##                      Class: young Class: middle Class: old
## Sensitivity                0.8341        0.6197    0.42308
## Specificity                0.7533        0.7856    0.95018
## Pos Pred Value             0.7654        0.5919    0.64286
## Neg Pred Value             0.8248        0.8046    0.88598
## Prevalence                 0.4910        0.3341    0.17489
## Detection Rate             0.4096        0.2070    0.07399
## Detection Prevalence       0.5351        0.3498    0.11510
## Balanced Accuracy          0.7937        0.7027    0.68663

Accuracy comparison of KNN models

knn.h_a_test_cm$overall[1]
##  Accuracy 
## 0.6203288
knn.r5kf8_a_test_cm$overall[1]
##  Accuracy 
## 0.6913303
knn.loo_a_test_cm$overall[1]
## Accuracy 
## 0.690583

Task 6: Measure Performance

Premise

For each type of classifier calculate and display the following performance related metrics in R. Use the library library(ROCR)

TBD

References